Basic EM workflow 1 (Restaurants data set)

Introduction

This IPython notebook explains a basic workflow to match two tables using py_entitymatching. Our goal is to match restaurants from Fodors and Zagat sites. The datasets contain information about the restaurants.

First, we need to import py_entitymatching package and other libraries as follows:


In [1]:
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')

import py_entitymatching as em
import pandas as pd
import os

In [2]:
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )


python version: 3.5.2 | packaged by conda-forge | (default, Sep  8 2016, 14:36:38) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]
pandas version: 0.19.2
magellan version: 0.1.0

Matching two tables typically consists of the following three steps:

1. Reading the input tables

2. Blocking the input tables to get a candidate set

3. Matching the tuple pairs in the candidate set

Read input tables

We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.


In [3]:
# Get the paths
path_A = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/fodors.csv'
path_B = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/zagats.csv'

In [4]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')


Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.

In [5]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))


Number of tuples in A: 533
Number of tuples in B: 331
Number of tuples in A X B (i.e the cartesian product): 176423

In [6]:
A.head()


Out[6]:
id name addr city phone type
0 534 arnie mortons of chicago 435 s. la cienega blv. los angeles 310/246-1501 american
1 535 arts delicatessen 12224 ventura blvd. studio city 818/762-1221 american
2 536 hotel bel-air 701 stone canyon rd. bel air 310/472-1211 californian
3 537 cafe bizou 14016 ventura blvd. sherman oaks 818/788-3536 french
4 538 campanile 624 s. la brea ave. los angeles 213/938-1447 american

In [7]:
B.head()


Out[7]:
id name addr city phone type
0 1 apple pan the 10801 w. pico blvd. west la 310-475-3585 american
1 2 asahi ramen 2027 sawtelle blvd. west la 310-479-2231 noodle shops
2 3 baja fresh 3345 kimber dr. westlake village 805-498-4049 mexican
3 4 belvedere the 9882 little santa monica blvd. beverly hills 310-788-2306 pacific new wave
4 5 benitas frites 1433 third st. promenade santa monica 310-458-2889 fast food

In [8]:
# Display the keys of the input tables
em.get_key(A), em.get_key(B)


Out[8]:
('id', 'id')

Block tables to get candidate set

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching. py_entitymatching provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.

For the matching problem at hand, we know that two restaurants with no overlap between the names will not match. So we decide the apply blocking over names:


In [9]:
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'name', 'name', 
                    l_output_attrs=['name', 'addr', 'city', 'phone'], 
                    r_output_attrs=['name', 'addr', 'city', 'phone'],
                    overlap_size=1, show_progress=False)

In [10]:
C.head()


Out[10]:
_id ltable_id rtable_id ltable_name ltable_addr ltable_city ltable_phone rtable_name rtable_addr rtable_city rtable_phone
0 0 1036 1 pacific pan pacific hotel 500 post st. san francisco 415/929-2087 apple pan the 10801 w. pico blvd. west la 310-475-3585
1 1 1016 7 hyde street bistro 1521 hyde st. san francisco 415/441-7778 bistro 45 45 s. mentor ave. pasadena 818-795-2478
2 2 552 7 pinot bistro 12969 ventura blvd. los angeles 818/990-0500 bistro 45 45 s. mentor ave. pasadena 818-795-2478
3 3 652 7 bistro garden 176 n. canon dr. los angeles 310/550-3900 bistro 45 45 s. mentor ave. pasadena 818-795-2478
4 4 991 7 bistro roti 155 steuart st. san francisco 415/495-6500 bistro 45 45 s. mentor ave. pasadena 818-795-2478

Match tuple pairs in candidate set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes. This typically involves the following four steps:

  1. Sampling and labeling the candidate set
  2. Train matcher using labeled data
  3. Predict the matches in the candidate set using trained matcher

Sampling and labeling the candidate set

First, we randomly sample 450 tuple pairs for labeling purposes.


In [11]:
# Sample  candidate set
S = em.sample_table(C, 450)

Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.


In [12]:
# Label S
G = em.label_table(S, 'gold')

For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.


In [13]:
path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/lbl_restnt_wf1.csv'
G = em.read_csv_metadata(path_G, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
len(G)


Out[13]:
450

Train matcher using labeled data

First, we need to create a set of features.py_entitymatching provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.


In [14]:
# Generate features automatically 
feature_table = em.get_features_for_matching(A, B)

Next, we convert the labeled data to feature vectors using the feature table


In [15]:
# Select the attrs. to be included in the feature vector table
attrs_from_table = ['ltable_name', 'ltable_addr', 'ltable_city', 'ltable_phone',
                    'rtable_name', 'rtable_addr', 'rtable_city', 'rtable_phone']
# Convert the labeled data to feature vectors using the feature table
H = em.extract_feature_vecs(G, 
                            feature_table=feature_table, 
                            attrs_before = attrs_from_table,
                            attrs_after='gold',
                            show_progress=False)

Then, we train the learning-based matcher using the feature vectors. For the purposes of the guide, we will use Random Forest matcher that is included in the py_entitymatching package.


In [16]:
# Instantiate the RF Matcher
rf = em.RFMatcher()

In [17]:
# Get the attributes to be projected while training
attrs_to_be_excluded = []
attrs_to_be_excluded.extend(['_id', 'ltable_id', 'rtable_id', 'gold'])
attrs_to_be_excluded.extend(attrs_from_table)

In [18]:
# Train using feature vectors from the labeled data.
rf.fit(table=H, exclude_attrs=attrs_to_be_excluded, target_attr='gold')

Predict the matches in the candidate set using trained matcher

Now, we use the trained matcher to predict matches in the candidate set. To do that, first we need to convert the candidate set to feature vectors.


In [19]:
# Select the attrs. to be included in the feature vector table
attrs_from_table = ['ltable_name', 'ltable_addr', 'ltable_city', 'ltable_phone',
                    'rtable_name', 'rtable_addr', 'rtable_city', 'rtable_phone']
# Convert the cancidate set to feature vectors using the feature table
L = em.extract_feature_vecs(C, feature_table=feature_table,
                             attrs_before= attrs_from_table,
                             show_progress=False)

Next, we predict the matches in the candidate set using the trained matcher and the feature vectors.


In [20]:
# Get the attributes to be excluded while predicting 
attrs_to_be_excluded = []
attrs_to_be_excluded.extend(['_id', 'ltable_id', 'rtable_id'])
attrs_to_be_excluded.extend(attrs_from_table)

In [21]:
# Predict the matches
predictions = rf.predict(table=L, exclude_attrs=attrs_to_be_excluded,                          
              append=True, target_attr='predicted', inplace=False)

In [22]:
predictions.head()


Out[22]:
_id ltable_id rtable_id ltable_name ltable_addr ltable_city ltable_phone rtable_name rtable_addr rtable_city ... city_city_sw type_type_jac_qgm_3_qgm_3 type_type_cos_dlm_dc0_dlm_dc0 type_type_jac_dlm_dc0_dlm_dc0 type_type_mel type_type_lev_dist type_type_lev_sim type_type_nmw type_type_sw predicted
0 0 1036 1 pacific pan pacific hotel 500 post st. san francisco 415/929-2087 apple pan the 10801 w. pico blvd. west la ... 1.0 0.000000 0.0 0.0 0.513889 6.0 0.250000 0.0 1.0 0
1 1 1016 7 hyde street bistro 1521 hyde st. san francisco 415/441-7778 bistro 45 45 s. mentor ave. pasadena ... 2.0 0.222222 0.0 0.0 0.634199 7.0 0.363636 0.0 3.0 0
2 2 552 7 pinot bistro 12969 ventura blvd. los angeles 818/990-0500 bistro 45 45 s. mentor ave. pasadena ... 1.0 0.000000 0.0 0.0 0.424242 9.0 0.181818 -3.0 2.0 0
3 3 652 7 bistro garden 176 n. canon dr. los angeles 310/550-3900 bistro 45 45 s. mentor ave. pasadena ... 1.0 1.000000 1.0 1.0 1.000000 0.0 1.000000 11.0 11.0 0
4 4 991 7 bistro roti 155 steuart st. san francisco 415/495-6500 bistro 45 45 s. mentor ave. pasadena ... 2.0 0.000000 0.0 0.0 0.424242 9.0 0.181818 -3.0 2.0 0

5 rows × 48 columns

Finally, project the attributes and the predictions from the predicted table.


In [23]:
# Get the attributes to be projected out
attrs_proj = []
attrs_proj.extend(['_id', 'ltable_id', 'rtable_id'])
attrs_proj.extend(attrs_from_table)
attrs_proj.append('predicted')

# Project the attributes
predictions = predictions[attrs_proj]

In [24]:
predictions.head()


Out[24]:
_id ltable_id rtable_id ltable_name ltable_addr ltable_city ltable_phone rtable_name rtable_addr rtable_city rtable_phone predicted
0 0 1036 1 pacific pan pacific hotel 500 post st. san francisco 415/929-2087 apple pan the 10801 w. pico blvd. west la 310-475-3585 0
1 1 1016 7 hyde street bistro 1521 hyde st. san francisco 415/441-7778 bistro 45 45 s. mentor ave. pasadena 818-795-2478 0
2 2 552 7 pinot bistro 12969 ventura blvd. los angeles 818/990-0500 bistro 45 45 s. mentor ave. pasadena 818-795-2478 0
3 3 652 7 bistro garden 176 n. canon dr. los angeles 310/550-3900 bistro 45 45 s. mentor ave. pasadena 818-795-2478 0
4 4 991 7 bistro roti 155 steuart st. san francisco 415/495-6500 bistro 45 45 s. mentor ave. pasadena 818-795-2478 0